[00:05.790 --> 00:11.750]  Hi, and today with Young-Hoo Lee, I'll be presenting Detecting Handcrafted Social
[00:11.750 --> 00:15.510]  Engineering Emails with a Bleeding Edge Neural Language Model.
[00:19.940 --> 00:22.700]  A bit about us before getting into the talk.
[00:22.700 --> 00:28.680]  Sophos is a security products and services company. We have a whole range of security
[00:28.680 --> 00:35.020]  products ranging from firewalls, to mobile security, to email security, to endpoint security.
[00:35.020 --> 00:43.040]  And the team that I manage, Sophos AI, is the AI shop inside of Sophos. We research and develop
[00:43.040 --> 00:47.000]  the company's machine learning technology. And then we're also responsible...
[00:47.000 --> 00:50.340]  Okay, I'm gonna mute myself. I'm gonna like definitely mute myself in here so I can listen
[00:50.340 --> 00:56.180]  to the talk. I'm happy if you need something. Young-Hoo Lee is one of the star researchers
[00:56.180 --> 01:03.460]  on my team. He's almost completely solely responsible for the machine learning subsystem
[01:03.460 --> 01:12.180]  that protects our Android customers, at least responsible for the R&D part of that work.
[01:12.180 --> 01:16.000]  He's also the principal author of the work presented today. I'm presenting in a supporting
[01:16.000 --> 01:20.180]  role here. I'll mostly just be setting the stage for the original work that Young-Hoo is gonna
[01:20.180 --> 01:30.040]  describe. So, the problem we're solving. Most generally, we're working on the problem of
[01:30.040 --> 01:36.520]  detecting phishing emails. More specifically, we're working on detecting business email
[01:36.520 --> 01:42.380]  compromise phishing emails and targeted phishing emails. So, to understand what this is, it's
[01:42.380 --> 01:49.100]  useful to say what it's not. We're not focused on detecting mass campaigns where detection
[01:49.100 --> 01:54.340]  reduces to a near-duplicate detection problem because attackers are just sending out millions
[01:54.340 --> 02:00.020]  of copies of basically the same phishing email. Detecting those types of emails turns out to be
[02:00.200 --> 02:07.380]  a side effect of our focus. But our real focus is on detecting new, custom-authored, bespoke
[02:07.380 --> 02:13.200]  phishing emails that are based on research on a target. And this diagram does a good job of
[02:13.200 --> 02:18.840]  getting across the kind of workflow we're looking at stopping. In this workflow, cyber criminals and
[02:18.840 --> 02:27.080]  attackers identify targets through open-source research, usually on the web. Then they establish
[02:27.080 --> 02:34.980]  contact with those targets. In step two, we call this grooming. In this case, they put out a lure,
[02:34.980 --> 02:41.280]  an initial email, usually or sometimes an initial text message, and then build trust and authenticity
[02:41.280 --> 02:49.500]  around some identity that they're impersonating with the mark. In step three, they cash out the
[02:49.500 --> 02:54.860]  trust that they've built with their targets and make an ask of those targets. Oftentimes, that
[02:54.860 --> 02:59.500]  will be around wiring money or sending credentials. And then in step four, they actually
[03:01.820 --> 03:04.050]  receive money or receive the credentials.
[03:07.020 --> 03:11.540]  So within targeted phishing, business email compromise, which is focused on
[03:11.540 --> 03:17.160]  stealing money from businesses and other organizations, has been a growing trend.
[03:17.160 --> 03:23.220]  So as you can see here, in July 2016, a few billion dollars were stolen, according to the FBI,
[03:23.220 --> 03:27.720]  through business email compromise attacks. These are targeted phishing attacks that extort money
[03:27.720 --> 03:35.020]  from organizations. In May 2017, that number had grown to almost six billion. And in 2018,
[03:35.020 --> 03:39.140]  that number had grown to more than 12 billion. I don't have data for the last two years, but
[03:39.140 --> 03:44.940]  I see no reason to believe that this trend has attenuated. I think it's likely it's continued
[03:44.940 --> 03:50.460]  on a similar trajectory. This is a big problem. We see this in Sophos's customer base,
[03:50.460 --> 03:55.520]  which is which is a pretty large sample. We also hear about it from other folks in the
[03:55.520 --> 03:59.940]  cybersecurity space. And so we're very focused on it because it's affecting people. And it's
[03:59.940 --> 04:04.220]  not just affecting large organizations. You can see on this on the axis on the right,
[04:04.220 --> 04:12.580]  that something like 80,000 organizations had been hit by July 2018 by these attacks.
[04:13.900 --> 04:17.860]  So there are lots of small and mid-sized organizations getting hit. And oftentimes,
[04:17.860 --> 04:23.100]  financial damage can be in the hundreds of thousands of dollars and really impact people's
[04:23.100 --> 04:29.240]  lives when these attacks happened. So again, just to reiterate, we're focused on primarily
[04:29.240 --> 04:33.220]  these business email compromise use cases, but also more generally targeted phishing in which
[04:34.600 --> 04:38.920]  a lot of manual labor goes into the phishing process on the criminal actors.
[04:41.520 --> 04:45.560]  And I think it almost goes without saying, but we're focused on step two and three
[04:45.560 --> 04:51.300]  of these criminal actors workflow, the steps that are mediated over email. So we see lots of
[04:51.300 --> 04:58.840]  malicious emails exchanged in the grooming step. And then we also see, obviously, a malicious email
[04:58.840 --> 05:02.440]  transmitted in the exchange of information step where the attacker makes the ask of the
[05:03.040 --> 05:10.920]  target, usually employee. And this is just because we're focused on email as our signal.
[05:11.860 --> 05:14.500]  And that's the scope of the work that I'll be talking about today.
[05:16.180 --> 05:21.760]  Just to flesh this out a little bit, here's an example, phishing email sent in the later
[05:24.060 --> 05:30.900]  epics of the grooming stage of an attacker's workflow. Here, the attacker has established
[05:30.900 --> 05:38.040]  themselves as an impersonated chancellor of UC Berkeley. They are emailing an employee at UC
[05:38.040 --> 05:42.640]  Berkeley, asking if they're available, looking to exchange messages with them, probably about to
[05:42.640 --> 05:51.510]  make an ask around a money transfer. Now, I think it's important to highlight why phishing
[05:51.510 --> 05:59.570]  detection is hard. And why in particular, detecting new previously unseen phishing emails written as
[05:59.570 --> 06:07.590]  part of a manual phishing campaign, like what I've just described, is hard. And what this boils down
[06:07.590 --> 06:13.870]  to, I think, is that classical natural language processing problems are hard. It's hard to
[06:13.870 --> 06:18.990]  get computers to understand language in any meaningful way and reason about language in any
[06:18.990 --> 06:26.530]  meaningful way. And detecting phishing emails really boils down to building models that have
[06:26.530 --> 06:34.670]  some level of understanding of language. So to understand why algorithmically it's hard to make
[06:34.670 --> 06:38.910]  sense of language, let's look at a few classical natural language processing problems. So one of
[06:38.910 --> 06:46.470]  those is coreference resolution. To get a sense of what this problem means, consider the following
[06:46.470 --> 06:51.510]  sentence. I went to the store for some milk, and based on the price, decided to buy it.
[06:52.470 --> 06:57.950]  Now, from a grammatical perspective, it could refer to the store or to the milk, right? It's
[06:57.950 --> 07:01.470]  possible that I went to the store for some milk, and based on the price of the store, I decided to
[07:01.470 --> 07:09.030]  buy it. But it's more likely that I bought the milk. As humans, in solving this coreference
[07:09.030 --> 07:17.430]  resolution problem, so resolving what it refers to, we have to plumb the depths of a number of
[07:17.430 --> 07:23.950]  complex mental models, right? We need to deploy our syntactical model of the English language,
[07:23.950 --> 07:27.850]  our semantic model of the English language, and our model of the world in which a person is more likely
[07:27.850 --> 07:32.930]  to have bought milk at the store than have bought the store itself. And that's how we solve this
[07:32.930 --> 07:39.530]  problem. Hopefully, it's clear that it's hard to get algorithms to do that. But to detect phishing
[07:39.530 --> 07:44.890]  emails, we really need to understand language. And this is a problem that's constituent of the
[07:44.890 --> 07:49.030]  problem of understanding language. Word polysemy is also a classic problem
[07:49.030 --> 07:53.850]  in algorithmic understanding of language. So a sentence like, he drank a lot and was quite the
[07:53.850 --> 08:00.830]  rake. Grammatically, it's valid to interpret this sentence as meaning that he drank a lot
[08:00.830 --> 08:08.590]  and was quite the garden tool used to rake leaves off your lawn. That's clearly not the right,
[08:08.590 --> 08:13.070]  that's not the sense in which the word rake is being used. The word rake is being used in the
[08:13.070 --> 08:20.970]  sense of a drunk, semi-criminal, sort of dissolute individual here. But it takes a pretty deep
[08:20.970 --> 08:27.870]  exercise of a human being's mental models to arrive at the sense in which this word was used.
[08:27.870 --> 08:33.150]  And not easy to reproduce in the form of an automated agent, either machine learning or
[08:33.150 --> 08:38.430]  based on regexes and rules. Sentiment detection is another hard problem in natural language
[08:38.430 --> 08:43.450]  processing. So consider the sentence, I'm not angry at all. No, of course. Why would I be angry
[08:43.450 --> 08:49.130]  that you spent our life savings on your mistress? So clearly, the speaker is angry here. And they're
[08:49.130 --> 08:54.630]  being sarcastic. Detecting that they're angry and sarcastic is not trivial and requires that
[08:54.630 --> 09:00.010]  we understand not only the syntactic structure of the sentence, but also the semantics of the
[09:00.010 --> 09:05.470]  sentence. And requires that we have the reflex that this person is probably angry if their
[09:05.470 --> 09:15.050]  interlocutor spent their life savings on their mistress. Okay, so to solve phishing means that
[09:15.050 --> 09:19.330]  we need algorithms that can make sense of language. Making sense of language is hard, as these three
[09:19.330 --> 09:28.610]  problems demonstrate. So a good solution to the phishing problem would model as intermediate
[09:28.610 --> 09:34.150]  steps to detecting that an email is a phishing email. At some level, be able to solve these
[09:34.150 --> 09:37.930]  problems and sit somewhere in the depths of this intermediate representation of the language that
[09:37.930 --> 09:42.410]  it's looking at. The other challenge we have, obviously, in any cybersecurity problem, at least
[09:42.410 --> 09:46.350]  any detection context in cybersecurity, is that we have adversaries who'd like to bypass our
[09:46.350 --> 09:51.490]  detection. And that's also worth considering. So these are all reasons why the problem that we're
[09:51.490 --> 09:56.190]  presenting here is a hard one, deserves... we haven't solved it completely, deserves attention
[09:56.190 --> 10:05.110]  from our community. Now, the approach that we're using to attack the phishing problem
[10:05.610 --> 10:11.490]  is based in neural networks and deep learning, and a specific advance that happened in the last
[10:11.490 --> 10:19.710]  few years known as transformers. So transformers, or more specifically, transformer blocks, are a
[10:19.710 --> 10:26.950]  new kind of construct in neural networks, much like convolutions were new, I think, as of the
[10:26.950 --> 10:31.730]  90s or late 80s, and backpropagation was a new idea in neural networks, I think,
[10:31.730 --> 10:36.430]  starting in the 80s. Transformers are a new idea that's come out in recent years,
[10:37.290 --> 10:44.370]  and they help model language with a depth and fidelity that seems to be genuinely new and
[10:44.370 --> 10:49.090]  represent a step function in our ability to model language. So they're very exciting. The big idea
[10:49.090 --> 10:52.670]  behind the work that we're presenting today is that we're taking transformers and applying them
[10:52.670 --> 10:57.010]  to a cybersecurity problem, which we haven't seen much of before. So I'm going to talk a bit about
[10:57.010 --> 11:02.270]  what transformers are. A detailed discussion of how they work is beyond the scope of this talk,
[11:02.270 --> 11:09.130]  but I'm going to give some intuition. And then I'll pass the mic metaphorically over to
[11:09.130 --> 11:19.540]  Young, who will present on how we're using transformers. Okay, so here's an example,
[11:19.540 --> 11:26.940]  which I think helps to illustrate some ways in which transformers represent a real breakthrough
[11:26.940 --> 11:34.580]  in machine comprehension of language. So in this example, and I expect some of you have seen this
[11:34.580 --> 11:42.720]  because this went viral last year, a researcher wrote a prompt. This prompt is given at the top
[11:42.720 --> 11:50.360]  here in a shocking finding, that paragraph. And then a transformer model just took it from there
[11:50.360 --> 11:58.740]  and wrote a story based on that prompt. And I think when you see the story that the machine wrote,
[11:58.740 --> 12:06.680]  you'll see that it betrays a understanding of the syntax and semantics of language
[12:07.760 --> 12:12.560]  that's quite striking. So I think it's worth reading this out loud. The human-written prompt
[12:12.560 --> 12:17.860]  reads, in a shocking finding, scientists discovered a herd of unicorns living in a remote,
[12:17.860 --> 12:22.360]  previously unexplored valley in the Andes Mountains. Even more surprising to the researchers
[12:22.360 --> 12:27.840]  was the fact that the unicorn spoke perfect English. And now again, the challenge to the
[12:27.840 --> 12:33.440]  left off and write something coherent. And if it can, that reflects something around its ability
[12:33.440 --> 12:40.740]  to solve problems like the fundamental problems I described earlier. So the neural network
[12:40.740 --> 12:45.980]  continues. The scientists named the population after their distinctive horn, Ovid's unicorn.
[12:45.980 --> 12:50.180]  These four-horned, silver-white unicorns were previously unknown to science.
[12:50.180 --> 12:55.380]  Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
[12:55.380 --> 13:00.440]  Dr. Jorge Perez, an evolutionary biologist from the University of La Paz and several companions,
[13:00.440 --> 13:05.460]  were exploring the Andes Mountains when they found a small valley with no other animals or humans.
[13:05.860 --> 13:09.860]  Perez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks
[13:09.860 --> 13:15.360]  of rock and silver snow. Perez and the others then ventured further into the valley. By the time we
[13:15.360 --> 13:19.700]  reached the top of one peak, the water looked blue with some crystals on top, said Perez.
[13:19.700 --> 13:27.280]  And the essay continues. But what interested Younghoo and I in seeing examples of what
[13:27.280 --> 13:32.720]  transformers can do like this is that it seems that the transformers have, examples like these
[13:32.720 --> 13:37.680]  suggest that transformers have understood the syntax and semantics of the English language
[13:37.680 --> 13:45.220]  to some degree, and which begs the question, could the representations that the transformers
[13:45.220 --> 13:52.140]  learn in their parameter structure be useful in detecting phishing emails, since these
[13:52.140 --> 13:59.080]  transformer models seem to have distilled knowledge of how language works and what it means
[14:00.600 --> 14:06.220]  within their parameter settings. Okay, so this is one example of, sort of impressive example,
[14:06.220 --> 14:09.960]  of what transformers can do, that beg the question of whether or not they could be
[14:09.960 --> 14:14.580]  applied usefully to phishing. And we'll show results from our experiments later in Younghoo's
[14:14.580 --> 14:23.840]  section. Here's another example of just how sort of marvelous transformer models are with respect to
[14:26.380 --> 14:32.360]  how much more sophisticated they seem to be. So here a researcher typed into this box
[14:33.100 --> 14:36.660]  a description of some HTML that he would like the transformer models to generate.
[14:36.660 --> 14:42.200]  He writes, a button with the color of Donald Trump's hair. And the model actually writes,
[14:42.820 --> 14:49.040]  as a text completion, some HTML and CSS. It seems to understand that Donald Trump has yellow hair,
[14:49.040 --> 14:57.000]  and it makes a button, and it writes valid HTML and CSS. Which is, I think, just
[14:58.720 --> 15:02.820]  impressive in the absolute sense, but also for folks who've been in the NLP space for a while,
[15:02.820 --> 15:07.620]  it's impressive relatively. Five, ten years ago, if you would have shown somebody that we could do
[15:07.620 --> 15:13.820]  this in 2020, I think they would have been a bit incredulous. This is a big step for you.
[15:13.960 --> 15:18.280]  Applications like these represent a big step forward in natural language processing. And
[15:18.280 --> 15:24.640]  they're due, again, to this new idea called a transformer block. That's the key building
[15:24.640 --> 15:29.120]  block out of which models like this model, which is called GPT-3. It's from OpenAI,
[15:29.700 --> 15:39.340]  which is an AI research lab, are based. Okay. So, moving on.
[15:39.900 --> 15:47.180]  So, I want to go back a little bit in the history of NLP as a way of describing how
[15:47.180 --> 15:54.040]  transformers are new and different. So, I assume a substantial chunk of our audience today
[15:54.640 --> 16:02.720]  has studied basic machine learning on text. And typically, in like a machine learning 101
[16:03.380 --> 16:10.200]  course, you learn about the bag of words model and about discrete states, Markov models of language.
[16:10.860 --> 16:15.600]  And I want to talk about these representations as a way of talking about how limited those
[16:15.600 --> 16:19.580]  representations were, and then I'll talk about how transformers break us free of some of those
[16:19.580 --> 16:25.960]  limitations. So, a bag of words model of a document is a way of representing a document
[16:25.960 --> 16:30.280]  numerically for the purposes of machine learning, in which we just count up how many times each
[16:30.280 --> 16:37.600]  word in that document appeared. And then we create a matrix out of all the documents we have,
[16:37.600 --> 16:43.000]  and all the words in all those documents. And the entries in those matrix are just word counts.
[16:43.000 --> 16:50.320]  So, in the case on the left over here, the column vectors in our matrix are documents,
[16:50.320 --> 16:56.160]  and they get one dimension per word in our vocabulary, and they get counts of how many
[16:56.160 --> 17:01.560]  times words appeared in that vocabulary in the entries of that particular vector.
[17:02.840 --> 17:07.120]  And hopefully it's intuitive that, you know, once you've represented your documents in a
[17:07.120 --> 17:11.120]  vocabulary space in this way, you can compare documents by taking some distance measure
[17:11.120 --> 17:16.400]  between pairs of documents. You can also train machine learning models on your document corpus
[17:16.400 --> 17:24.520]  to say classified news articles is about sports or politics. But what you've done in the first
[17:24.520 --> 17:32.140]  step of these models is drop out sequence information. So, you've forgotten about which
[17:32.140 --> 17:38.760]  words come in which order in the document, and you've just represented your document as a bag
[17:38.760 --> 17:43.820]  of words. Which is a useful simplifying assumption, and it's one that we still use
[17:43.820 --> 17:48.720]  today in some of the modeling we do in my research group at Sophos. But it throws out
[17:48.880 --> 17:52.820]  a ton of information that transformers and more modern models don't throw out.
[17:53.740 --> 17:58.460]  In the model on the right, we have a discrete state Markov model of language.
[17:58.740 --> 18:05.160]  Here each word is a state, and the concept of language given by the model
[18:06.440 --> 18:12.560]  is kind of a cheesier adventure story in which the next words in an utterance depends only on
[18:12.560 --> 18:17.620]  the current words, and you're just sort of drawing from a probability distribution and moving through
[18:17.620 --> 18:22.580]  this graph to generate language. You could never have generated anything close to that unicorn
[18:22.580 --> 18:29.120]  story using a Markov model, and yet as recently as the last five, ten years, there's lots of
[18:29.120 --> 18:34.840]  papers coming out around like using hidden Markov models to parse sentences and that kind of thing.
[18:35.160 --> 18:39.060]  These are still useful models, but we've gone far beyond the simplifying assumptions
[18:39.520 --> 18:47.540]  in these original sort of simplified models of the world in NLP from the past few decades.
[18:49.500 --> 18:54.820]  So let's contrast transformers now, and I'll get into more details about how transformers
[18:54.820 --> 18:59.400]  work in a second, but let's contrast transformers with these earlier natural
[18:59.400 --> 19:06.320]  language processing approaches. So pre-transformer, most machine learning
[19:06.320 --> 19:14.620]  approaches didn't consider words in context. Many approaches made the simplifying bag of
[19:14.620 --> 19:22.980]  words assumptions as kind of a first step in the modeling process, and then ran term vectors
[19:22.980 --> 19:29.980]  through models like topic models, or logistic regression, or support vector machines.
[19:30.620 --> 19:39.240]  Most models didn't model kind of like co-reference relationships between words,
[19:39.240 --> 19:44.520]  or sort of which words pertain to which other words in a sentence, and I'll talk about what
[19:44.520 --> 19:51.140]  that means later, but transformers do sort of solve that problem. Most approaches didn't
[19:53.320 --> 20:00.340]  operate on either words or characters. Typically the way people use transformers,
[20:00.340 --> 20:06.600]  we use well-chosen chunks of words, which allows us to model misspelling and this kind of thing,
[20:06.600 --> 20:10.960]  so there's been an improvement there in the current generation of natural language processing models.
[20:12.060 --> 20:22.200]  And older approaches tended not to use neural network technology. That's changed a lot in the
[20:22.200 --> 20:29.400]  best ideas that have come out of the neural network revolution that's been ongoing since
[20:29.400 --> 20:38.980]  around 2012. So transformers kind of kick apart a number of log jams in natural language modeling.
[20:38.980 --> 20:44.320]  They give words contextual representations, they model attention, like the relationship between
[20:44.320 --> 20:50.260]  words in a document, they use these smart partial word representations that allow for
[20:50.260 --> 20:54.340]  misspellings and just the tower of babble of vernaculars that appear
[20:55.800 --> 21:00.500]  like under the banner of, say, the English language on the internet, and they take advantage
[21:00.500 --> 21:07.160]  of ideas like residual connections and modern optimizers and many of the really good ideas
[21:07.160 --> 21:11.980]  that have come out of the neural network revolution. So these are all reasons why
[21:11.980 --> 21:21.540]  we wanted to test their applicability to phishing detection. Okay, so if you want to get into detail
[21:21.540 --> 21:27.780]  about how transformers work, I'd recommend this blog post by Jay Alomar. It's where I cribbed this
[21:28.340 --> 21:34.360]  screenshot here. I'm not going to get into the details of the series of matrix multiplications
[21:34.360 --> 21:41.140]  that comprise summations and various linear algebra and operations that comprise the
[21:41.140 --> 21:45.640]  transformer block. I just want to give a little bit of intuition here before passing the mic over
[21:45.640 --> 21:52.300]  to Younghoo. The basic idea behind a transformer block, which is kind of a Lego block,
[21:52.860 --> 21:59.540]  out of which you build a transformer-based neural network, is we're taking in a sequence of words.
[22:00.140 --> 22:04.060]  This diagram shows a very simple example where we're just taking a sequence of two words. Typically
[22:04.060 --> 22:11.400]  we've taken a larger window, like 512 words. We pass them into the block, and the way they
[22:11.400 --> 22:20.760]  get passed in is not just as entries in a term vector matrix, but actually as vectors themselves.
[22:20.760 --> 22:27.920]  The words get a vector representation. These are known as embeddings. We pass them both into the
[22:27.920 --> 22:35.440]  transformer network. The first thing that the transformer network does to our input word
[22:35.440 --> 22:45.520]  sequence is model the attention relationships between the words. In a two-word case, it's a
[22:45.520 --> 22:51.180]  little bit harder to describe here, but basically if you had a sequence of 15 words,
[22:51.180 --> 22:57.840]  for every word, the attention mechanism would compute how much
[22:57.840 --> 23:04.240]  that word pertains to the other 14 words in the sentence. You'll see how that has a
[23:04.240 --> 23:08.600]  relationship with the coreference resolution problem I talked about earlier. But it's just
[23:08.600 --> 23:14.560]  intuitive that there's a graph of word relationships in a sentence, and self-attention
[23:14.560 --> 23:21.820]  kind of models that in terms of which subjects pertain to which objects,
[23:21.820 --> 23:27.760]  which pronouns pertain to which people, etc. There's an addition and normalization step
[23:27.760 --> 23:33.680]  that happens when we've run this self-attention process a number of times
[23:34.920 --> 23:39.460]  and typically we don't just do self-attention once. We have a number of what are called heads and
[23:39.460 --> 23:44.520]  we run attention a number of times. We combine all that together, we do some non-linear transformation
[23:44.520 --> 23:49.800]  on it, and then we wind up with a new embedding of your original sequence of the same dimensionality
[23:49.800 --> 23:56.000]  as this original embedding, except that now thinking is encoded in this new representation
[23:56.000 --> 24:00.680]  in the context in which it appears, in the context of machines. Machines is encoded in the context of
[24:00.680 --> 24:04.400]  thinking. Then typically we stack these transformer blocks that we actually do.
[24:04.400 --> 24:07.840]  We have another transformer block that then sort of refines the representation and we keep
[24:07.840 --> 24:12.500]  going. Young will show that we use a number of transformer blocks in our phishing detection
[24:12.500 --> 24:19.240]  work in a few minutes. Here's some intuition as to what comes out of the attention mechanism
[24:20.220 --> 24:25.980]  in a typical transformer. So here's what a transformer block has decided it
[24:27.720 --> 24:31.440]  sort of relates to in an input sentence. So here we have the input sentence,
[24:31.440 --> 24:34.880]  the animal didn't cross the street because it was too tired.
[24:36.440 --> 24:43.700]  And the strongest attentional relationship here goes to the animal, which is interesting
[24:43.700 --> 24:49.140]  because one could ask whether or not it refers to the street or the animal here.
[24:49.780 --> 24:54.200]  One can interpret the weight of the connection to the animal, meaning that the transformer block
[24:54.960 --> 25:00.760]  has decided, in their quotes, that it pertains to the animal, which is really interesting.
[25:01.660 --> 25:05.160]  So hopefully you get some intuition as to how powerful this attentional representation is and
[25:05.160 --> 25:11.160]  how important it is in machine learning models, getting some what we might call understanding
[25:11.160 --> 25:18.410]  over the language that they're analyzing. Okay, so I want to put the intuition together
[25:18.410 --> 25:23.410]  around how transformers pertain to the work that YoungHu and I are presenting today.
[25:24.230 --> 25:31.350]  So basically, what we're going to do in our phishing model is embed an email as a sequence
[25:31.350 --> 25:39.370]  of embedded character sequence vectors. And then we're going to run that through a series of
[25:39.370 --> 25:46.350]  transformer blocks, like what I just showed, that are going to create a very refined attentional
[25:46.350 --> 25:54.790]  representation of the word sequences, and then produce these contextual embeddings that get at
[25:54.790 --> 26:00.490]  the meaning of the words in the context in which they appear. And then finally, our network is
[26:00.490 --> 26:05.790]  going to solve a classification task, say whether or not the email is a phishing email or not.
[26:06.310 --> 26:10.190]  How all that magic works, how the network gets trained, which there's some tricks there that
[26:10.190 --> 26:16.170]  are really cool. I'll leave to YoungHu, and hope a lot of this makes sense, and happy to
[26:16.170 --> 26:20.550]  take questions about my piece of this presentation later at the end of the talk.
[26:21.970 --> 26:30.850]  Thank you, Joshi. Let me continue the second part of our talk. The second part will include
[26:31.570 --> 26:39.250]  our design decisions for CatBot and performance result.
[26:39.830 --> 26:48.130]  CatBot is the name of our email model, context-aware tiny bot. The model size is tiny,
[26:48.130 --> 27:01.580]  but it is mighty bot. Modern NLP models all have a nice and friendly name. For example,
[27:02.150 --> 27:11.120]  Elmer was introduced in 2018. The model used bidirectional LSTMs to
[27:11.650 --> 27:16.440]  generate contextualized word embeddings.
[27:17.650 --> 27:27.980]  And then, same year later, Google researchers introduced bot and achieved the state-of-art
[27:27.980 --> 27:37.120]  performance in many English-understanding problems. Next year, 2019, Baidu researchers
[27:37.120 --> 27:46.660]  introduced the Orni, and the model achieved another state-of-art performance in many
[27:46.660 --> 27:56.220]  Chinese-language-understanding problems. They are all popular characters from Sesame Street.
[27:57.180 --> 28:06.880]  This year, 2020, we introduced CatBot to tackle email security problems.
[28:12.010 --> 28:20.270]  Transformer-based NLP models are powerful, but they are complex and heavy.
[28:20.790 --> 28:29.050]  And it is challenging to deploy heavy models for real-time applications.
[28:29.670 --> 28:38.770]  So our first design goal is to convert the heavy model into a lightweight model,
[28:38.770 --> 28:45.810]  so we can reduce the number of parameters and then we can improve inference speed.
[28:46.450 --> 28:53.890]  We downsized a baseline model called DistribBot, which has six transformer blocks.
[28:54.590 --> 29:00.650]  We take half of transformer blocks from pre-trained model and then replace missing
[29:00.650 --> 29:10.110]  transformers with simple adapters. For example here, we take transformer 1, 3, 5,
[29:10.110 --> 29:18.510]  and then we added two adapters. Also, we can take other number of transformer blocks.
[29:19.320 --> 29:24.840]  This approach allows simply significantly reduced number of parameters.
[29:29.940 --> 29:39.020]  The second goal is to improve the model performance by combining additional input.
[29:40.420 --> 29:50.160]  Standard NLP models only accept text data as input. However, we can extract additional
[29:50.750 --> 29:58.420]  features from email headers and we can use the additional input to our model.
[29:58.900 --> 30:07.660]  So the text input will be the input to the embedding and then we can add additional input
[30:07.660 --> 30:13.100]  to the classification header and we added additional dense layers in the classification
[30:13.100 --> 30:22.160]  to combine the input from transformer block and another header-related input.
[30:22.680 --> 30:28.880]  With the additional input, we improved our model's performance further.
[30:34.010 --> 30:38.750]  Let me talk about the details of our trans adapters.
[30:40.350 --> 30:48.310]  We inserted two adapters here and each adapter has quite a simple architecture.
[30:49.050 --> 30:57.490]  Each adapter will have two dense layers and there is one non-linear activation unit in between
[30:57.490 --> 31:01.710]  and we have a skip connection.
[31:02.930 --> 31:11.390]  The dimensionality of a dense unit is the same as the output of transformer blocks.
[31:11.690 --> 31:20.450]  And the two dense layers are initialized with near-zero values. So as initially, the adapters
[31:20.450 --> 31:32.450]  will act as identity blocks, but however, they will gradually change the data from the lower
[31:32.450 --> 31:41.130]  transformer block to upper transformer block and to minimize classification loss.
[31:45.580 --> 31:53.380]  We also modified the standard fine-tuning method by using a partial fine-tuning method.
[31:54.800 --> 32:02.740]  Standard fine-tuning involves updating all parameters jointly. However, for partial
[32:02.740 --> 32:11.700]  fine-tuning, we only update upper blocks, but we fix low blocks. For example, here,
[32:11.700 --> 32:21.100]  the low blocks embedding the transformer 1 and 3 are fixed, but we update adapter 1 and 2
[32:21.100 --> 32:31.240]  and transformer 5 and classification header. This approach was to minimize forgetting
[32:31.240 --> 32:39.400]  problems of learned presentations from low transformer blocks.
[32:42.230 --> 32:50.430]  As mentioned earlier, we have two set of features. One from text data and another one from
[32:51.290 --> 33:00.410]  email headers. We can use multiple email header builders, for example, from two cc-reply
[33:01.370 --> 33:11.670]  builders to extract additional context information. And we consider the subject and the text as
[33:12.170 --> 33:25.030]  context input. The first set of features are from email text content features.
[33:25.270 --> 33:35.050]  We extract text data from subject and plain text body. If only html content is available,
[33:35.050 --> 33:43.650]  then we can also extract plain text from html data using a html parser. For example,
[33:43.650 --> 33:57.130]  this one is a simple html hyperlink, but we only extract visible text visit bank site as output.
[33:57.910 --> 34:04.150]  And then from the extracted text, we remove less informative characters.
[34:04.870 --> 34:10.770]  We remove digit and punctuation characters. The second example, there are many
[34:10.770 --> 34:21.030]  hyphens, but the hyphens will be removed in this step. Otherwise, each hyphen will be
[34:21.810 --> 34:33.230]  individual token. Finally, we select 120 tokens as input to the transformer blocks.
[34:37.290 --> 34:44.870]  We use a sub-word tokenized called WordPiece. The WordPiece tokenized can overcome
[34:44.870 --> 34:53.950]  some of the limitations from character-level or word-level tokens. Character-level tokens are
[34:53.950 --> 35:00.190]  too fine-grained, so it is hard to recognize word boundaries and meaning of words.
[35:00.190 --> 35:06.370]  And word-level tokens often have auto-vocabulary problems.
[35:07.320 --> 35:16.090]  The sub-word token tokenized can split complex or uncommon words into sub-tokens.
[35:16.470 --> 35:27.090]  For example, CAPBOT can be divided into simple CAP and BOT, and double HES is an indicator
[35:27.090 --> 35:39.330]  for sub-words. Similarly, SOPHOS can be divided into three tokens, SO, PH, and OS.
[35:40.630 --> 35:47.470]  The sub-word tokenized reduce number of unknown tokens in our email data.
[35:51.830 --> 35:59.150]  With the selected tokens, the tokens will be the input to the embedding layer,
[35:59.150 --> 36:10.450]  and then we have three transformer blocks. And each transformer block has 12 multi-head
[36:10.450 --> 36:19.770]  attention layers. And each individual attention layer runs contextual relationship between tokens.
[36:20.890 --> 36:24.270]  And the lighter diagram shows attention rate between
[36:24.830 --> 36:34.230]  tokens. The transfer token has multiple attention rate for other tokens.
[36:37.240 --> 36:47.480]  In our email data, we have many non-English emails, and also non-English emails can include
[36:47.480 --> 36:56.340]  English words. Also, English emails can include non-English words.
[36:57.360 --> 37:06.040]  The non-English emails account for 25% of our total P9 and malicious emails.
[37:06.040 --> 37:18.240]  So, we needed to support a multilingual model which can recognize different languages.
[37:21.170 --> 37:29.030]  How we can support multilingual emails? The solution is BOT comes with two versions,
[37:29.030 --> 37:36.490]  English and multilingual BOT. The English version was pre-trained with the English text
[37:36.490 --> 37:44.350]  datasets, including Wikipedia, and has 30,000 English tokens.
[37:46.210 --> 37:54.370]  The multilingual version was pre-trained with large text datasets from more than 100 languages.
[37:54.970 --> 38:00.670]  And this version has four times large vocabulary size, which is 120 tokens,
[38:00.670 --> 38:05.790]  which will cover many Unicode characters and Unicode words.
[38:06.490 --> 38:14.890]  So, we fine-tuned a multilingual BOT for our multilingual emails.
[38:18.270 --> 38:26.190]  The second set of results from email headers. We can extract multiple indicators from
[38:27.170 --> 38:34.970]  email header builders. For example, the first one, we check whether the emails are from
[38:34.970 --> 38:45.090]  internal or external. We can compare the domain of recipient and sender. For example, here,
[38:45.580 --> 38:52.750]  the domain name is a similar looking one, but actually there are extra ads.
[38:52.750 --> 39:02.330]  So, this one we consider as external email. And then, we can also use external reply
[39:02.990 --> 39:12.230]  by comparing the domain of from and reply to. Often, targeted phishing attackers use
[39:13.050 --> 39:20.090]  a different domain for reply to. And also, we collect the size of the recipients and the size
[39:20.090 --> 39:27.710]  of the carbon copy recipients as additional indicators. It is obvious that targeted
[39:27.710 --> 39:35.210]  phishing attacks will have only single recipient. However, many user
[39:37.110 --> 39:45.090]  phishing emails will have multiple recipients or carbon copy recipients.
[39:50.280 --> 39:55.100]  Next, we will have a look at the performance of our CatBot.
[40:00.640 --> 40:04.660]  Let's have a look at the performance of a CatBot.
[40:06.900 --> 40:12.500]  We used a data set of 10 million benign samples.
[40:12.720 --> 40:22.180]  And the data set also includes 350 phishing emails and 1000 BC emails.
[40:23.740 --> 40:33.220]  We used time split to allocate 70% of samples as training and remaining 30% for test samples.
[40:34.010 --> 40:44.140]  We included two baseline models to compare CatBot. The first one is DistilBot, which has
[40:44.140 --> 40:53.280]  six transformer blocks. And another one is LSTM Long Short Memory. The model is a recurrent neural
[40:53.280 --> 41:05.140]  network architecture, which also uses BERT's same embedding layer. We trained three models on a
[41:05.140 --> 41:12.920]  GPU instance from AWS. And we assigned high sample weight for the
[41:12.920 --> 41:19.380]  PEC samples and oversample minor class malicious samples to allocate balanced
[41:20.120 --> 41:29.240]  samples in each mini-matching. To compare performance, we use R-Curves and L-Re-Under-Curve.
[41:29.240 --> 41:36.240]  Also, we compare inference speed and the model size as key performance metrics.
[41:40.280 --> 41:51.780]  These R-Curves compare our CatBot model with two baseline models. The top blue one is CatBot,
[41:51.780 --> 41:58.140]  and the second one is DistilBot, and the bottom green one is LSTM.
[41:58.380 --> 42:03.140]  Our model outperformed the two baseline models.
[42:04.340 --> 42:14.600]  And our model achieved 0.82% positive rate at 0.1% false positive rate.
[42:17.410 --> 42:27.610]  Next, we compare the performance when we removed adapters and context input rated layers.
[42:28.110 --> 42:37.710]  The top one is CatBot, and the second orange one is when we removed adapters. And the bottom one
[42:38.060 --> 42:44.550]  is when context input was removed from the CatBot.
[42:45.010 --> 42:53.530]  We can see significant performance drop when we remove either adapter or context input,
[42:53.530 --> 43:03.990]  which demonstrates we can improve performance by using additional adapters and context rated layers.
[43:07.760 --> 43:18.010]  Next, we compare the performance of three models with targeted BC samples.
[43:18.010 --> 43:25.450]  We assigned high sample rate for BC samples, and we achieved high performance for detecting
[43:25.450 --> 43:33.050]  those BC samples and the CatBot above the two baseline models.
[43:35.460 --> 43:44.420]  Next, we compare performance for phishing emails. We divide the phishing emails into two groups,
[43:44.420 --> 43:54.760]  English and non-English emails. Our CatBot outperformed for English and non-English emails.
[43:56.000 --> 44:03.720]  And our model was based on the multilingual bot, so we can see significant performance
[44:03.720 --> 44:14.970]  when we use the model for detecting non-English emails, when it compares with simple LST model.
[44:16.570 --> 44:24.330]  Next, we compare the inference speed. This bot has six transformer blocks, and the CatBot has
[44:24.330 --> 44:33.210]  three transformer blocks, so we achieve two times speed up in inference time when we measure the
[44:33.210 --> 44:46.090]  performance on a CPU machine. As the number of blocks decrease, the inference time can be reduced.
[44:49.560 --> 44:51.900]  Next, model size.
[44:53.820 --> 44:57.000]  For comparison, we divide the model size into two
[44:58.400 --> 45:03.180]  parts, embedding and transformer blocks.
[45:06.190 --> 45:16.990]  The DistilBot has six transformers and has 92 million parameters for embedding and 42 million
[45:16.990 --> 45:25.010]  parameters for transformer blocks. DistilBot and the CatBot have six transformer blocks.
[45:25.010 --> 45:31.550]  We reuse the same embedding parameters, but we reduce the number of parameters for
[45:31.550 --> 45:40.070]  transformer blocks by 50%. In total, our model size is 85% of baseline model.
[45:40.650 --> 45:44.390]  When we apply the same mechanism for English version,
[45:45.070 --> 45:53.770]  the English CatBot will have 71% of parameters from baseline model.
[45:58.490 --> 46:05.490]  Next, we will inspect how CatBot generates outputs.
[46:08.240 --> 46:19.280]  We use a LIME method to interpret our predictions. LIME is a
[46:19.280 --> 46:26.860]  local interpretable model ergodastic explanation method which can be applied to any black box
[46:26.860 --> 46:35.140]  models. We can understand a model by perturbing the input and understanding how the predictions
[46:35.140 --> 46:50.250]  change. The first LIME example is for a benign email. The prediction score for this email
[46:51.010 --> 46:57.530]  is close to zero. And we highlight legitimate tokens with
[46:59.370 --> 47:06.190]  blue color and malicious ones as orange one. And this one, we don't have any high-rated
[47:06.190 --> 47:17.500]  tokens for malicious ones. Next, we have a BC sample. The model prediction score for
[47:17.500 --> 47:28.860]  maliciousness is close to one. And the model recognizes transfer and urgent payment as
[47:28.860 --> 47:42.170]  high-rated tokens. Next, we have another BC sample, which is related with gift card.
[47:42.170 --> 47:48.970]  And the model prediction score is close to one and card and urgently.
[47:50.110 --> 48:03.710]  The tokens are high-rated for this email. Next, we have two handcrafted social engineering emails.
[48:03.710 --> 48:12.570]  They look quite different, but if you read the text carefully, they are asking the same wire transfer.
[48:13.450 --> 48:22.850]  And the model predictions are close to one for both emails and the highlight tokens are payment
[48:24.050 --> 48:36.470]  and swift or as soon as possible. These examples demonstrated our model's ability to
[48:37.190 --> 48:45.390]  understand complex texts and conceptually similar emails can be identified.
[48:48.830 --> 48:57.150]  In conclusion, our CatBot is a carefully re-architectural transform-based model.
[48:57.150 --> 49:03.950]  With this architecture, we achieve both high-speed and high-accuracy in detecting
[49:05.010 --> 49:11.290]  handicrafted social engineering email attacks. In the future, we want to apply the same
[49:11.290 --> 49:21.850]  design decisions to new GPT-3 model. Thank you. Do you have any questions?
[49:25.180 --> 49:29.600]  If you do have any questions, please head over to the Discord channel.
[49:30.640 --> 49:34.980]  They are currently there answering questions in the DEF CON Discord channel
[49:34.980 --> 49:40.640]  under aiv-talks-text.
